Why we are doing this study?

This is a study case used as a capstone for the Google Data Analytics certificate Problem: My favorite activity in the world is sleeping. Could I optimize it by controlling the different variables that might affect it?

Solution: Let’s dive into a sleeping study to discover which variables will improve or ruin a good night sleep!

Statistical Information

library(tidyverse)
library(ggplot2)
library(patchwork)
library(ggpubr)
data <- read.csv("sleepdata.csv")
head(data)
##   Person.ID Gender Age           Occupation Sleep.Duration Quality.of.Sleep
## 1         1   Male  27    Software Engineer            6.1                6
## 2         2   Male  28               Doctor            6.2                6
## 3         3   Male  28               Doctor            6.2                6
## 4         4   Male  28 Sales Representative            5.9                4
## 5         5   Male  28 Sales Representative            5.9                4
## 6         6   Male  28    Software Engineer            5.9                4
##   Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1                      42            6   Overweight         126/83         77
## 2                      60            8       Normal         125/80         75
## 3                      60            8       Normal         125/80         75
## 4                      30            8        Obese         140/90         85
## 5                      30            8        Obese         140/90         85
## 6                      30            8        Obese         140/90         85
##   Daily.Steps Sleep.Disorder
## 1        4200           None
## 2       10000           None
## 3       10000           None
## 4        3000    Sleep Apnea
## 5        3000    Sleep Apnea
## 6        3000       Insomnia
summary(data)
##    Person.ID         Gender               Age         Occupation       
##  Min.   :  1.00   Length:374         Min.   :27.00   Length:374        
##  1st Qu.: 94.25   Class :character   1st Qu.:35.25   Class :character  
##  Median :187.50   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :187.50                      Mean   :42.18                     
##  3rd Qu.:280.75                      3rd Qu.:50.00                     
##  Max.   :374.00                      Max.   :59.00                     
##  Sleep.Duration  Quality.of.Sleep Physical.Activity.Level  Stress.Level  
##  Min.   :5.800   Min.   :4.000    Min.   :30.00           Min.   :3.000  
##  1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00           1st Qu.:4.000  
##  Median :7.200   Median :7.000    Median :60.00           Median :5.000  
##  Mean   :7.132   Mean   :7.313    Mean   :59.17           Mean   :5.385  
##  3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00           3rd Qu.:7.000  
##  Max.   :8.500   Max.   :9.000    Max.   :90.00           Max.   :8.000  
##  BMI.Category       Blood.Pressure       Heart.Rate     Daily.Steps   
##  Length:374         Length:374         Min.   :65.00   Min.   : 3000  
##  Class :character   Class :character   1st Qu.:68.00   1st Qu.: 5600  
##  Mode  :character   Mode  :character   Median :70.00   Median : 7000  
##                                        Mean   :70.17   Mean   : 6817  
##                                        3rd Qu.:72.00   3rd Qu.: 8000  
##                                        Max.   :86.00   Max.   :10000  
##  Sleep.Disorder    
##  Length:374        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
#Number of null values per column
colSums(is.na(data))
##               Person.ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep.Duration        Quality.of.Sleep 
##                       0                       0                       0 
## Physical.Activity.Level            Stress.Level            BMI.Category 
##                       0                       0                       0 
##          Blood.Pressure              Heart.Rate             Daily.Steps 
##                       0                       0                       0 
##          Sleep.Disorder 
##                       0
gen_avg <- data %>% 
  group_by(Gender) %>%
  summarise(mean_age = mean(Age), n = n())

gen_avg
## # A tibble: 2 × 3
##   Gender mean_age     n
##   <chr>     <dbl> <int>
## 1 Female     47.4   185
## 2 Male       37.1   189

The average age of Women is higher. This could skew the data as age could be an important factor in sleep quality. But we do have a 50/50 gender representation.

Exploratory Data Analysis (EDA)

data$Sleep.Disorder = factor(data$Sleep.Disorder, levels = c('None','Insomnia','Sleep Apnea'))

df <- data %>% 
  group_by(Sleep.Disorder) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

ggplot(df, aes(x = "", y = perc, fill = Sleep.Disorder)) +
  geom_col() +
  geom_label(aes(label = labels), color = "black",
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  scale_fill_brewer(palette = "Reds") +  
  coord_polar("y", start = 0) +  
  theme_void() +
  ggtitle("Percentage of Sleep Disorders") +
  theme(plot.title = element_text(hjust=0.5))

We do not have an equal representation of every sleep class, but we almost have a 60/40 representation of healthy and disease respectively.

my_comparisons <- list( c("None", "Insomnia"), 
                        c("Insomnia", "Sleep Apnea"), 
                        c("None", "Sleep Apnea") )

ggplot(data, aes(x=Sleep.Disorder, y=Quality.of.Sleep, fill=Sleep.Disorder)) +
  geom_boxplot()+
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  scale_fill_brewer(palette="Reds") +
  stat_summary(fun = "mean", geom = "point", shape = 8,
               size = 2, color = "white")+ 
  stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value
  stat_compare_means(label.y = 12) + # Add global p-value
  ggtitle("Quality fo Sleep score distribution \n per Sleep Disorder") +
  theme(plot.title = element_text(hjust=0.5))     

The average quality of aleep score for a person without a disorder is ~7.5. Sleep apnea does not seem to be correlated with a lower quality of sleep, we can see no significant difference between the Sleep Apnea and No Disorder classes. Insomnia is the only class where we see a significant loss in quality of sleep. The quality of sleep metric seems to be very subjective as it does not definitely indicate a sleep disorder or not.

gend_group <- data %>%
  group_by(Gender,Sleep.Disorder,Age)

gender_bar <- ggplot(data, aes(x=Gender, y=Age, fill=Sleep.Disorder)) +
  scale_fill_brewer(palette="Reds") +
  geom_bar(stat="summary", fun.y = "mean", position="dodge") +
  ggtitle("Sleep Disorders per Gender")+
  theme(plot.title = element_text(hjust=0.5))     

gender_bar 

As stated before, the female group has a higher average age and seems more affected by sleep apnea that the male group, while the male group seems the most affected by insomnia.

data$BMI.Category <- gsub("Normal Weight", "Normal", data$BMI.Category)
data$BMI.Category <- factor(data$BMI.Category, labels=c('Normal','Overweight','Obese'))

bmi_bar <- ggplot(data, aes(x=BMI.Category, fill=Sleep.Disorder)) +
  scale_fill_brewer(palette="Reds") +
  geom_bar() +
  ggtitle("BMI effect on Sleep") +
  theme(plot.title = element_text(hjust=0.5))
bmi_bar

We cleaned up this column by merging “Normal” and “Normal Weight” samples. The overweight category seems severely under represented.

A conclusion we can pull from this figure is that BMI is correlated with sleep quality. A higher BMI tends to indicate insomnia and/or sleep apnea, while in the lowest BMI category has a majority of normal sleep. Obese class has almost 50/50 split on both sleep disorders. We cannot make an conclusions on the Overweight category due to the small sample size but it does seem to indicate a correlation with sleep disorders.

data_stress <- data %>%
  group_by(Occupation) %>%
  summarize(avg_stress = mean(Stress.Level),
            avg_age = mean(Age),
            count = n())
  
occup_point <- ggplot(data, aes(x=Occupation, y=Stress.Level)) +
  geom_point(aes(fill=Sleep.Disorder,size=Age), color='black', shape=21, stroke=0.4) +
  scale_fill_brewer(palette="Reds") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5),
        plot.margin = margin(t = 5, r = 10, b = 5, l = 10),
        axis.title.x = element_text(margin = margin(t = 20)),
        plot.title = element_text(hjust=0.5)) +
  ggtitle("Occupation, Age and Stress \n effect on Sleep") 

occup_bar <- ggplot(data, aes(x=Occupation, fill=Sleep.Disorder)) +
  geom_bar() +
  scale_fill_brewer(palette="Reds")+
  ggtitle("Occupation \n effect on Sleep") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.7),
        plot.margin = margin(t = 5, r = 10, b = 5, l = 10),
        axis.title.x = element_text(margin = margin(t = 20)),
        plot.title = element_text(hjust=0.5)) 

print(data_stress)
## # A tibble: 11 × 4
##    Occupation           avg_stress avg_age count
##    <chr>                     <dbl>   <dbl> <int>
##  1 Accountant                 4.59    39.6    37
##  2 Doctor                     6.73    32.7    71
##  3 Engineer                   3.89    46.6    63
##  4 Lawyer                     5.06    39.4    47
##  5 Manager                    5       45       1
##  6 Nurse                      5.55    51.8    73
##  7 Sales Representative       8       28       2
##  8 Salesperson                7       43.5    32
##  9 Scientist                  7       33.5     4
## 10 Software Engineer          6       31.2     4
## 11 Teacher                    4.53    41.7    40
occup_bar + occup_point + plot_layout(widths = c(1, 1))

The occupations with the highest stress scores seem to be the most correlated with sleep disorders, as illustrated by Sales representative, Salespersons and Scientists, especially causing sleep apnea. The occupations with a lower average stress score but a higher age group average, like nurses and teachers, also show an increase in the number of sleep disorders, even when the stress score is on the low end (see Nurses with score of 3).

BP_ranges = c('Normal', 'Elevated', 'Hypertension_1', 'Hypertension_2', 'Hypertensive_Crisis')
BP_systolic_limits = list(c(0,120),c(120,130),c(130,140),c(140,180),c(180,200))
BP_diastolic_limits = list(c(0,80),c(0,80),c(80,90),c(90,120),c(120,140))

# Categorize into BP ranges depending on the systolic and diastolic values 
data <- data %>%
  separate(Blood.Pressure, into=c('Systolic','Diastolic'), sep="/") %>%
  mutate(Systolic = as.numeric(Systolic),
         Diastolic = as.numeric(Diastolic)) %>%
  mutate(BP_range = case_when(
    between(Systolic, BP_systolic_limits[[1]][1], BP_systolic_limits[[1]][2]) & 
      between(Diastolic, BP_diastolic_limits[[1]][1], BP_diastolic_limits[[1]][2]) ~ BP_ranges[1],
    
    between(Systolic, BP_systolic_limits[[2]][1], BP_systolic_limits[[2]][2]) & 
      between(Diastolic, BP_diastolic_limits[[2]][1], BP_diastolic_limits[[2]][2]) ~ BP_ranges[2],
    
    between(Systolic, BP_systolic_limits[[3]][1], BP_systolic_limits[[3]][2]) & 
      between(Diastolic, BP_diastolic_limits[[3]][1], BP_diastolic_limits[[3]][2]) ~ BP_ranges[3],
    
    between(Systolic, BP_systolic_limits[[4]][1], BP_systolic_limits[[4]][2]) & 
      between(Diastolic, BP_diastolic_limits[[4]][1], BP_diastolic_limits[[4]][2]) ~ BP_ranges[4],
    
    between(Systolic, BP_systolic_limits[[5]][1], BP_systolic_limits[[5]][2]) & 
      between(Diastolic, BP_diastolic_limits[[5]][1], BP_diastolic_limits[[5]][2]) ~ BP_ranges[5],
    
    TRUE ~ "Unknown" # Default case
  ))


# New values to replace "Unknown"
replacement_values <- c(rep("Elevated", 12), rep("Hypertension_1", 2), "Elevated")

data <- data %>%
  mutate(BP_range = replace(BP_range, BP_range == "Unknown", replacement_values)) %>%
  mutate(Heart.Rate = as.factor(Heart.Rate))

data$Heart.Rate <- as.numeric(as.character(data$Heart.Rate))
# Group by BP_range and Heart.Rate, then count occurrences
data_grouped <- data %>%
  group_by(BP_range, Heart.Rate, Sleep.Disorder, Age) %>%
  summarise(Count = n(), .groups = "drop")  # Count occurrences of each Heart Rate

# Convert BP_range into a factor with the specified order
data_grouped$BP_range <- factor(data_grouped$BP_range, levels = BP_ranges)

blood_bar <- ggplot(data_grouped, aes(x = BP_range, y = Count, fill = Heart.Rate)) +
  geom_col() +  
  scale_fill_viridis_c(option = "magma", direction = -1) +  
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
  ggtitle("Heart rate of \n of blood pressure ranges") +
  theme(plot.title = element_text(hjust=0.5))

blood_bar

We separated the “Blood.Pressure” column into “Systolic” (#/) and “Diastolic” (/#) categories and then depending on their combination create a categorical blood pressure range column: “BP_range” from the American Heart Association . We manually clean up the “Unknown” ranges by making assumptions (putting more importance on the systolic number) and assigning them a category.

As expected, we can see an increase in heart rate with each blood pressure range.

data_grouped$Sleep.Disorder = factor(data_grouped$Sleep.Disorder, levels = 
                                       c('None','Insomnia','Sleep Apnea'))

blood_sleep <- ggplot(data_grouped, aes(x=BP_range, y=Heart.Rate)) + 
  geom_jitter(aes(fill=Sleep.Disorder,size=Age), color='black', 
              shape=21, stroke=0.4, width=0.3) + 
  scale_fill_brewer(palette="Reds") +
  ggtitle("Effect of BP range, Heart Rate \n and Age on Sleep") +
  theme(plot.title = element_text(hjust=0.5))

blood_sleep

We see an evident correlation between hypertension and sleep disorders. Regardless of age, a lower heart rate and a normal BP range indicates a lower propensity to sleep disorders, while the hypertension I and II have a higher proportion of insomnia and sleep apnea, with hypertension II being almost exclusively being sleep apnea. An elevated BP does not seem correlated with sleep disorders.

data <- data %>% 
  mutate(Stress.Level = as.factor(Stress.Level))

act_bar <- ggplot(data, aes(x=Physical.Activity.Level,y=Daily.Steps)) +
  geom_jitter(aes(size=Age, fill=Stress.Level), color='black', shape=21, stroke=0.4) +
  scale_fill_brewer(palette="Blues") +
  geom_smooth(method=lm,  linetype="dashed",
              color="darkblue")

act_stress_bar <- ggplot(data, aes(x=Physical.Activity.Level,y=Daily.Steps)) +
  geom_jitter(aes(fill=Sleep.Disorder, size=Age), color='black', shape=21, stroke=0.4) +
  scale_fill_brewer(palette="Reds") +
  geom_smooth(method=lm,  linetype="dashed",
              color="darkred")
act_bar + act_stress_bar + plot_layout(widths = c(1, 1))

We see a really strong correlation between physical activity and daily steps with the stress level, the lower the physical activity and daily steps, the higher the stress level the person feels regardless of their age. We see a couple of outliers where the stress levels are high at the top right (highest number of steps with highest activity level). As seen by the following table:

data %>%
  group_by(Occupation, Stress.Level) %>%
  summarise(avg_steps = mean(Daily.Steps),
            avg_activity = mean(Physical.Activity.Level))
## # A tibble: 32 × 4
## # Groups:   Occupation [11]
##    Occupation Stress.Level avg_steps avg_activity
##    <chr>      <fct>            <dbl>        <dbl>
##  1 Accountant 3                7500          80  
##  2 Accountant 4                7000          60  
##  3 Accountant 6                7200          53.3
##  4 Accountant 7                6000          45  
##  5 Doctor     3                6850          87.5
##  6 Doctor     5                3500          65  
##  7 Doctor     6                8000          75  
##  8 Doctor     8                5848.         31.8
##  9 Engineer   3                5176.         30.9
## 10 Engineer   4                5644.         63.9
## # ℹ 22 more rows

The higher number of steps could be a result of the job and not a result of leisure sport, especially with nurses who have the highest number of steps over all the other occupations (8058 steps).

Insomnia seems to be correlated with a lower level of physical, while sleep apnea does not follow this pattern. The latter one seems to mainly affect older people regardless of the physical activity.

I

sleep_q <- ggplot(data, aes(x=Sleep.Duration,y=Quality.of.Sleep)) +
  geom_jitter(aes(fill=Sleep.Disorder), color='black', 
              shape=21, stroke=0.4, size=3, width=0.5) +  
  scale_fill_brewer(palette="Reds")

sleep_q

Conclusions

The goal of this study case was to use data analytics to derive insights, however I want to also apply some machine learning to obtain an order of feature importance and to see if this data is enough to predict a sleep disorder

Pre-processing data

Regression model

Conclusions